Goto

Collaborating Authors

 lean 4




ProofBridge: Auto-Formalization of Natural Language Proofs in Lean via Joint Embeddings

Jana, Prithwish, Kale, Kaan, Tanriverdi, Ahmet Ege, Song, Cruise, Vishwanath, Sriram, Ganesh, Vijay

arXiv.org Artificial Intelligence

Translating human-written mathematical theorems and proofs from natural language (NL) into formal languages (FLs) like Lean 4 has long been a significant challenge for AI. Most state-of-the-art methods either focus on theorem-only NL-to-FL auto-formalization or on FL proof synthesis from FL theorems. In practice, auto-formalization of both theorem and proof still requires human intervention, as seen in AlphaProof's silver-medal performance at the 2024 IMO, where problem statements were manually translated before automated proof synthesis. Our training ensures that NL-FL theorems (and their proofs) are mapped close together in this space if and only if the NL-FL pairs are semantically equivalent. Experiments show substantial improvements in proof auto-formalization over strong baselines (including GPT -5, Gemini-2.5, In mathematics, ensuring the correctness of proofs is a crucial yet inherently difficult task. Traditionally, mathematicians rely on the peer-review process for proof verification, yet as proofs grow increasingly complex, even careful human scrutiny can overlook subtle errors. For instance, in 1989, Kapranov and V oevodsky published a proof connecting -groupoids and homotopy types, which was later disproven by Carlos Simpson in 1998; more recently, while formalizing his 2023 paper (Tao, 2023) on the Maclaurin-type inequality, Terence Tao discovered a non-trivial bug. To mitigate challenges of verifying complex proofs, proof assistants and formal mathematical languages like Coq (Barras et al., 1999), Isabelle (Nipkow et al., 2002), HOL Light (Harrison, 2009), Meta-math (Megill & Wheeler, 2019), Lean 4 (Moura & Ullrich, 2021), and Peano (Poesia & Goodman, 2023) have been developed, offering a way to create computer-verifiable formal proofs. Such formal language (FL) proofs, defined by strict syntax and symbolic logic, enable reliable automated verification guarantees that resolve the inherent ambiguity of natural language (NL) proofs.


Evaluating Autoformalization Robustness via Semantically Similar Paraphrasing

Moore, Hayden, Shah, Asfahan

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have recently emerged as powerful tools for autoformalization. Despite their impressive performance, these models can still struggle to produce grounded and verifiable formalizations. Recent work in text-to-SQL, has revealed that LLMs can be sensitive to paraphrased natural language (NL) inputs, even when high degrees of semantic fidelity are preserved (Safarzadeh, Oroo-jlooyjadid, and Roth 2025). In this paper, we investigate this claim in the autoformalization domain. Specifically, we evaluate the robustness of LLMs generating formal proofs with semantically similar paraphrased NL statements by measuring semantic and compilation validity. Using the formal benchmarks MiniF2F (Zheng, Han, and Polu 2021) and Lean 4 version of ProofNet (Xin et al. 2024), and two modern LLMs, we generate paraphrased natural language statements and cross-evaluate these statements across both models. The results of this paper reveal performance variability across paraphrased inputs, demonstrating that minor shifts in NL statements can significantly impact model outputs.


IndiMathBench: Autoformalizing Mathematical Reasoning Problems with a Human Touch

Biyani, Param, Kirtania, Shashank, Bajpai, Yasharth, Gulwani, Sumit, Tiwari, Ashish

arXiv.org Artificial Intelligence

We introduce IndiMathBench, a human-verified benchmark designed to evaluate mathematical theorem proving, curated using an AI-powered human-assisted pipeline for formalizing natural language problems in Lean. IndiMathBench is composed of 312 formal Lean 4 theorems paired with their corresponding informal problem statements, sourced from Indian Mathematics Olympiads. Through category-based retrieval, iterative compiler feedback, and multi-model ensembles, our pipeline generates candidate formalizations that experts efficiently validate via an interactive dashboard with automated quality summaries. Evaluation across multiple frontier models demonstrates that autoformalization remains challenging, with substantial gaps between syntactic validity and semantic correctness, while theorem proving success rates remain low even with iterative refinement, demonstrating that \benchmark~presents a challenging testbed for mathematical reasoning. IndiMathBench is available at https://github.com/prmbiy/IndiMathBench.



CLEVER: A Curated Benchmark for Formally Verified Code Generation

Thakur, Amitayush, Lee, Jasper, Tsoukalas, George, Sistla, Meghana, Zhao, Matthew, Zetzsche, Stefan, Durrett, Greg, Yue, Yisong, Chaudhuri, Swarat

arXiv.org Artificial Intelligence

We introduce ${\rm C{\small LEVER}}$, a high-quality, curated benchmark of 161 problems for end-to-end verified code generation in Lean. Each problem consists of (1) the task of generating a specification that matches a held-out ground-truth specification, and (2) the task of generating a Lean implementation that provably satisfies this specification. Unlike prior benchmarks, ${\rm C{\small LEVER}}$ avoids test-case supervision, LLM-generated annotations, and specifications that leak implementation logic or allow vacuous solutions. All outputs are verified post-hoc using Lean's type checker to ensure machine-checkable correctness. We use ${\rm C{\small LEVER}}$ to evaluate several few-shot and agentic approaches based on state-of-the-art language models. These methods all struggle to achieve full verification, establishing it as a challenging frontier benchmark for program synthesis and formal reasoning. Our benchmark can be found on GitHub(https://github.com/trishullab/clever) as well as HuggingFace(https://huggingface.co/datasets/amitayusht/clever). All our evaluation code is also available online(https://github.com/trishullab/clever-prover).


VERINA: Benchmarking Verifiable Code Generation

Ye, Zhe, Yan, Zhengxu, He, Jingxuan, Kasriel, Timothe, Yang, Kaiyu, Song, Dawn

arXiv.org Artificial Intelligence

Large language models (LLMs) are increasingly integrated in software development, but ensuring correctness in LLM-generated code remains challenging and often requires costly manual review. Verifiable code generation -- jointly generating code, specifications, and proofs of code-specification alignment -- offers a promising path to address this limitation and further unleash LLMs' benefits in coding. Yet, there exists a significant gap in evaluation: current benchmarks often focus on only individual components rather than providing a holistic evaluation framework of all tasks. In this paper, we introduce Verina (Verifiable Code Generation Arena), a high-quality benchmark enabling a comprehensive and modular evaluation of code, specification, and proof generation as well as their compositions. Verina consists of 189 manually curated coding tasks in Lean, with detailed problem descriptions, reference implementations, formal specifications, and extensive test suites. Our extensive evaluation of state-of-the-art LLMs reveals significant challenges in verifiable code generation, especially in proof generation, underscoring the need for improving LLM-based theorem provers in verification domains. The best model, OpenAI o4-mini, achieves a 61.4\% code correctness rate, 51.0\% for specification soundness and completeness, and a mere 3.6\% proof success rate (based on one trial per task). We hope Verina will catalyze progress in verifiable code generation by providing a rigorous and comprehensive benchmark. We release our dataset on https://huggingface.co/datasets/sunblaze-ucb/verina and our evaluation code on https://github.com/sunblaze-ucb/verina.


Lean Finder: Semantic Search for Mathlib That Understands User Intents

Lu, Jialin, Emond, Kye, Yang, Kaiyu, Chaudhuri, Swarat, Sun, Weiran, Chen, Wuyang

arXiv.org Artificial Intelligence

We present Lean Finder, a semantic search engine for Lean and mathlib that understands and aligns with the intents of mathematicians. We further align Lean Finder with mathematicians' preferences using In addition, Lean Finder is compatible with LLM-based theorem provers, bridging retrieval with formal reasoning. Advances in Lean and mathlib (De Moura et al., 2015; Moura & Ullrich, 2021) are turning mathematical discovery into a collaborative and verifiable research workflow. Despite these advances, state-of-the-art LLMs still cannot solve math research problems. Lean's syn tax, gram mar, and tac tics in cur a steep learn ing curve. All experiments and data processing were conducted outside Meta. Figure 1: In the evaluation with user queries, real users preferred Lean Finder in 81.6% of cases, compared with Consider the two queries below. Lean search engines handle (Gao et al., 2024a;b; Ju & Dong, 2025; Asher, 2025): Denote L/K a field extension, x, y in L are algebraic elements over K with the same minimal polynomial. I'm working with algebraic elements over a field extension and I have two elements, say x and y in L. I know x is algebraic over K, and I've shown that y is a root of the minimal polynomial of x. Does this imply that the minimal polynomials of x and y are actually equal? T arget Statement 2: 1 theorem eq_of_root {x y: L} (hx: IsAlgebraic K x) (h_ev: Polynomial.aeval y (minpoly K x) = 0): minpoly K y = minpoly K x):= -- proof omitted for brevity This user latent (motivation, perspective, abstraction) cannot be inferred or encoded by a purely syntactic informalization. Addressing this challenge calls for Lean search engines that can understand a mathematician's intent, not merely We defer a more rigorous analysis in Section 2.2, and ask our core question: Our approach analyzes and clusters public discussions, then synthesizes queries that simulate user intents (Section 3.1).


ProofFlow: A Dependency Graph Approach to Faithful Proof Autoformalization

Cabral, Rafael, Do, Tuan Manh, Yu, Xuejun, Tai, Wai Ming, Feng, Zijin, Shen, Xin

arXiv.org Artificial Intelligence

Proof autoformalization, the task of translating natural language theorems and proofs into machine-verifiable code, is a critical step for integrating large language models into rigorous mathematical workflows. Current approaches focus on producing executable code, but they frequently fail to preserve the semantic meaning and logical structure of the original human-written argument. To address this, we introduce ProofFlow, a novel pipeline that treats structural fidelity as a primary objective. ProofFlow first constructs a directed acyclic graph (DAG) to map the logical dependencies between proof steps. Then, it employs a novel lemma-based approach to systematically formalize each step as an intermediate lemma, preserving the logical structure of the original argument. To facilitate evaluation, we present a new benchmark of 184 undergraduate-level problems, manually annotated with step-by-step solutions and logical dependency graphs, and introduce ProofScore, a new composite metric to evaluate syntactic correctness, semantic faithfulness, and structural fidelity. Experimental results show our pipeline sets a new state-of-the-art for autoformalization, achieving a ProofScore of 0.545, substantially exceeding baselines like full-proof formalization (0.123), which processes the entire proof at once, and step-proof formalization (0.072), which handles each step independently. Our pipeline, benchmark, and score metric are open-sourced to encourage further progress at https://github.com/Huawei-AI4Math/ProofFlow.